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Abstract 

We compute explicitly several abstract metrics for RNA secondary structures defined 
by Reidys and Stadler. 
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1 Introduction 

As it is well known, an RNA molecule can be viewed as a chain of (ribo)nucleotides with a 
definite orientation. Each of these nucleotides is characterized by (and in practice identified 
with) the base attached to it, which can be adenine (A), cytosine (C), guanine (G), or uracil 
(U). Thus, an RNA molecule with N nucleotides can be mathematically described as a word 
of length N over the alphabet {A, C, G, U}, called the primary structure of the molecule. 

In the cell and in vitro each RNA molecule folds into a three-dimensional structure, which 
determines its biochemical function. This structure is held together by weak interactions called 
hydrogen bonds between pairs of non-consecutive bases: actually, a hydrogen bond can only 
form between bases that are several positions apart in the chain, but we shall not take this 
restriction into account here. Most of these bonds form between Watson-Crick complementary 
bases, i.e., between A and U and between C and G, but a significant amount of bonds also form 
between other pairs of bases [9]. The secondary structure of an RNA molecule is a simplified 
model of this three-dimensional structure, consisting of an undirected graph with nodes its 
bases and arcs its base pairs or contacts; the length of a secondary structure is the number of 
its nodes. A restriction is added to the definition of secondary structure: a base can only pair 
with at most one base. This restriction is called the unique bonds condition. 

An important problem in molecular biology is the comparison of these RNA secondary 
structures, because it is assumed that a preserved three-dimensional structure corresponds 
to a preserved function. Moreover, the comparison of RNA secondary structures of a fixed 
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length is used in the prediction of RNA secondary structures to reduce the output of alternate 
structures when suboptimal solutions, and not only optimal, are considered |101 §IX]. In a 
seminal paper on the algebraic representation of biomolecular structures [7|, C. Reidys and P. 
F. Stadler introduced three abstract metrics on the set of RNA secondary structures of a fixed 
length based on their algebraic models and independent of any notion of graph edition, and 
they discussed their biophysical relevance. They ended that paper by asking, among other 
questions, whether there exists any relation between the metrics for RNA secondary structures 
they had defined. In this paper we answer this question by explicitly computing these metrics. 
In a subsequent paper 0] we plan to generalize these metrics to contact structures without 
unique bonds, as for instance protein structures. 

2 Main results 

From now on, let [n] denote the set {1, . . . , n}, for every positive integer n. 

Definition 1 An RNA secondary structure of length n is an undirected graph without multiple 
edges or self-loops T = ([ra], Q), for some n > 1, whose arcs {j, k} € Q, called contacts, satisfy 
the following two conditions: 

i) For every j € [n], + 1} £ Q- 

ii) For every j S [n], if {j, k}, {j, 1} 6 Q, then k — I. 

Condition (i) translates the impossibility of a contact between two consecutive bases, 
while condition (ii) translates the unique bonds condition. We should point out that this 
definition of RNA secondary structure is not the usual one, as the latter forbids the existence 
of (pseudo)knots: pairs of contacts {i,j} and {k,l} such that i < k < j < I. This rather 
unnatural condition is usually required in order to enable the use of dynamic programming 
methods to predict RNA secondary structures ^01 > but real secondary structures can contain 
knots and thus we shall not impose this restriction here. Therefore, our RNA secondary 
structures correspond to what in the literature on secondary structure modelling has been 
called contact structures with unique bonds [7||8] or 1-diagrams [2]. 

We shall denote from now on a contact {j, k} by j-k or k-j, without distinction. A node 
is said to be isolated in an RNA secondary structure when it is not involved in any contact. 

Let S ra stand for the set of all RNA secondary structures of length n and let S n be the 
symmetric group of permutations of [n\. 

Definition 2 For every T = ([n], Q) £ S n , say with Q — ■ ■ ■ , ik'jk}, let 

k 

t=i 

where denotes the transposition in S n defined by i <-> j. 

Reidys and Stadler proved in jj] that the mapping ir : S n — * S n is injective and that 
7r(r) is an involution for every r £ §„. This representation of RNA secondary structures as 
involutions is then used by these authors to define the following metric, called the involution 
metric. 
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Proposition 1 The mapping di nv : §„ x S n — > R sending every (Ti,T 2 ) G §„ io i/ie Zeast 
number di nv {Ti,T 2) of transpositions necessary to represent the permutation tt{Ti)t:{T 2 ), is a 
metric. 

The following proposition computes explicitly this metric. In it, and henceforth, AAB 
denotes the symmetric difference (Al) B) — (Ad B) of the sets A and B, and \A\ stands for 
the cardinal of the finite set A. 

Proposition 2 For every I\ = {[n],Qi),T 2 = {[n],Q 2 ) G S n , 

dinv(Tl,T 2 ) = 1(3^(321-20, 

where f2 is the number of cyclic orbits of length greater than 2 induced by the action on [n] of 
the subgroup (7r(Ti), 7r(r 2 )) of S n . 

Proof. Let Ti = ([n], Q\) and T 2 = ([n], Q 2 ) be two RNA secondary structures of length n. 
To simplify the language, we shall refer to the orbits induced by the action of (7r(Ti), 7r(r 2 )) on 
[n] simply by orbits. Notice that we can understand such an orbit as a subset i 2 , . . . , i m } 
of [n], m> 1, such that 

h-h,i2-h, ■ ■ ■ ,i m -i-im G Qi U Q 2 

and maximal with this property, i.e., such that any other contact in Q1UQ2 involving i\ or i rn 
can only be i\-i m . The unique bonds condition (or, in group-theoretical terms, the fact that 
the transpositions defining each n(Ti) are pairwise disjoint) implies that if {ii,i 2 , ■ ■ ■ , 4} is 
an orbit, then either 

■ ■ ■ , € <3i and i 2 -is,u-i$, . . . € Q 2 

or 

h-i2 1 h-ii, ■ ■ ■ , € Q 2 and i 2 -h,iA-ib, ■ ■ ■ G <3i- 

Such an orbit is cyclic if m — 2 and i\ -i 2 G Qi fl Q2, or m > 3 and n-i m G Qi U Q2, and an 
orbit is linear in all other cases. The fact that 7r(Ti), 7r(r 2 ) are both involutions implies that 
the cardinal of cyclic orbits is always even: roughly speaking, if ii-i 2 G Qi in a cyclic orbit, 
then i\-i m G Q 2 and hence i m -i-i m G Q\. 

If two transpositions appearing in the product 7r(ri)7r(r 2 ) are not disjoint, then the 
indexes involved in them belong to the same orbit. Moreover, two disjoint transpositions 
always commute. This allows us to reorganize the transpositions in the product k(Ti)i{(T 2 ), 
assembling them into subproducts corresponding to orbits. More specifically, if for every orbit 
O and for every i = 1, 2 we let 

k-leQi 

fe.ieo 

then 

n(T 1 )w(T 2 )= J] 7r(0,ri)7r(0,r 2 ). 

OG{orbits} 

Since the orbits are pairwise disjoint, this finally shows that the least number of transpositions 
which 7r(ri)7r(r 2 ) decomposes into is equal to the sum of the least numbers of transpositions 
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which it(0 : Ti) / k{0,T2) decompose into, for every orbit O. It remains to compute this last 
number for each type of orbit O. 

If O is a linear orbit of length m = 1, then ir(0, ri)7r(0, r 2 ) = Id, and it corresponds to 
a node that is isolated both in T\ and in IV 

Let now O = . . . , i m } be a linear orbit of length m > 2. Consider first the case when 
ii-«2, *3 - *4, • ■ ■ , im—i'im £ Qi an d «2'«3, ■ £ Q 2 ; in particular, m is even. Then 

7r(0,ri)7r(0,r 2 ) = {ii,i2){h,H) • • • (i m -i> *m)(*2, *3) ■ ■ • (i m -2,« m -i) 
= (i 2 , 14, . . . , i m , i m -i,i m s, . . . , is, ii), 

a cycle of length m that decomposes into the product of m— 1 transpositions (and it is the least 
number of transpositions required to represent it), which is exactly the number of contacts of 
Qi U Q2 involved in this orbit. 

A similar argument shows that in all other cases for a linear orbit O, the permutation 
7r(0, ri)7r(0, F 2 ) is equal to a cycle of length the number of elements of the orbit, and thus 
the least number of transpositions this product decomposes into is equal to the number of 
contacts of Q\ U Q2 involved in this orbit O, all of them belonging to QiAQ 2 . 

If O is a cyclic orbit of length m = 2, say O = then ir(0, Fi)w(0, T2) = 

(*i 3 *a) (*ij*2) = Id- Notice that cyclic orbits of length 2 correspond to contacts in Qi D Q 2 . 

Finally, assume that O is a cyclic orbit of length m > 3, say O — {ii, . . . ,i m } with 
ii-fa, h-H, ■ • ■ , im-i-im £ Qi and 12-13, ■ ■ ■ , im-2-im-i,im-h £ Q2; remember that m is in this 
case even. Then 

7r(0, ri)7r(0, T 2 ) = («l,«2)(«3i U) ■ ■ ■ (im-l, im)(h, h) ■ ■ ■ («m-2,4-l)(4,H) 
= (l2,i4, • ■ ■ ,im)(im-l,im-3, ■ ■ ■ 

the product of two disjoint cycles of length m/2. Since each cycle requires m/2 — 1 transposi- 
tions, the least number of transpositions the permutation ir(0, ri)7r(0, r 2 ) decomposes into 
is equal to m — 2, the number of contacts of Q\ U Q2 involved in this orbit O (all of them 
belonging again to Q1AQ2) minus 2. 

To sum up, and if we call £1 the number of cyclic orbits of length greater than 2, 

dinv(^i,^2) = ({contacts involved in linear orbits} 

+ 1 {contacts involved in cyclic orbits of length greater than 2}| — 2£1 
= |Q 1 AQ 2 |-2n, 

as we claimed. ■ 

The number and structure of the orbits induced by the action of (7r(ri), 7r(r 2 )) on [n] are 
related to the probability of transition from the neutral network of Ti (the set of sequences 
that fold into it) to that of T 2 : see §3] and the references cited therein. 

Let now Sub(5 rl ) be the set of subgroups of S n - 

Definition 3 For every T = ([n], Q) £ S„ ; say with Q — ■ ■ ■ , ik'jk}, let 

T(T) = {(i 1 ,j 1 ),...,(i k ,j k )} 

be the set of the transpositions corresponding to the contacts in Q and let G(T) = (T(T)) be 
the subgroup of S n generated by this set of transpositions. 



4 



Reidys and Stadler also proved in [7j that the mapping G : S n — ► Sub(6>„) is injective, and 
then they used this representation of RNA secondary structures as permutation subgroups to 
define the following subgroup metric. 

Proposition 3 The mapping d sgr : § n x S n — > M defined by 

a Sffr (r 1 ,r 2 ) = in i , „,„ ; — . 



V |G(ronG(r 2 )| 



is a metric. 



Next proposition shows that this metric simply measures, up to a constant factor, the 
cardinal of the symmetric difference of the sets of contacts. 

Proposition 4 For every Ti — ([n], Qi), r 2 = ([n],Q 2 ) £ §n> 

4 9 r(ri,r 2 ) = (ln2)|Q 1 AQ 2 |. 

Proof. Since the transpositions generating a group G(T), with T 6 S„, are pairwise disjoint, 
there is a bijection between G(T) and the powerset P(T(T)): each element of G(T) is the 
product of a subset of T(T) in a unique way. Hence, |G(Ti)| = and |G(r 2 )| = 2^. 

On the other hand, by the uniqueness of the decomposition of a permutation into a 
product of disjoint cycles, a permutation belongs to G(Ti) n G(T 2 ) if and only if it is a 
product of transpositions belonging to both G(Ti) and G(r 2 ). Therefore, 

G(ri) n G(r 2 ) - (T(r x ) nT(r 2 )) = ((«, j) | i-j e Q x n Q 2 ), 

and then, arguing as in the previous paragraph, we see that |G(Ti) n G(r 2 )| = 2^ in< ^ 2 ^. 
Now, it is well known that 

|G( ri )-G(r 2 )|- |G(ri)nG(r2)| , 

and hence 

a ^ r /IG^OI-IG^)^ _ 1 „ | Ql , 

+ |Q 2 |-2|<5inQ 2 _ 1„ 9IQ1AQ2 

a tgT [i 1,1 aj - in ^ |G(ri) n G(r2)|2 j-vril -luz 

as we claimed. ■ 

Notice in particular that, should Reidys and Stadler had defined their subgroup metric 
as log 2 (|G(r x ) • G(r 2 )|/|G(Ti) n G(r 2 )|), it would coincide with |QiAQ 2 |. 

The third metric on S„ proposed by Reidys and Stadler is actually a general way of 
defining metrics, rather than a single one, and it uses Magarshak and coworkers' algebraic 
representation of RNA secondary structures El El , recently extended in to cope with 
contacts other than Watson-Crick complementary base pairs. These authors represent an RNA 
secondary structure T = {[n],Q) as an n x n complex symmetric matrix Sp = {si,j)i,j=i,...,n 
where 

— 1 if i ^ j and i-j £ Q 
■> hJ — \ 1 if i = j and i-l £ Q for every I 
otherwise 
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Since S^ 1 = Sr for every T E §„, one can define for any Ti,^ G S rl the transfer matrix 
Tr lj r 2 = <Sr 2 ° 5rv Then, Reidys and Stadler propose to measure the difference between two 
RNA secondary structures by defining a metric through 

(ri,r 2 )~ ||r ri ,r a ||, 

where || ■ || stands for some length function on the group GL(n, C) of n x n invertible complex 
matrices Def. 9, Lem. 6] (actually, Reidys and Stadler propose to use a matrix norm || • ||, 
but it is probably a misprint, as it would not yield a metric). A simple and well-known length 
function on GL(n, C) is 

||A|| = rank(A - Id), 
which allows to define a metric on S„ 

dma ff (ri,r 2 ) = rank(T ri ,r 2 - Id). 

This metric turns out to be equal to the involution metric di nv defined above. 

Proposition 5 For every F 1 ,T 2 £ S n , d mag (r 1 ,r 2 ) = d inv (T 1 ,T 2 ). 

The proof of this proposition is similar to (and simpler than) the proof of ^ Thm. 17], 
which establishes essentially this equality for the generalized algebraic representation of RNA 
secondary structures in the sense of Magarshak introduced in that paper, and therefore we 
omit it. 
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